Exploratory Data Analysis of Joined Algae Clade & Bacteria in Coral Samples

read_csv("../output/coral_data.csv") -> corals
## Rows: 658 Columns: 5897
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr    (6): sample_id, Clade, Majority, species, region, reef
## dbl (5891): its2_count, ITS2_f, ASV0001, ASV0002, ASV0003, ASV0004, ASV0005,...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#ITS2_f should be a factor
corals %>% 
  mutate(ITS2_f = as.factor(ITS2_f)) -> corals
dim(corals)
## [1]  658 5897

What are the proportions of observations per clade?

corals %>% 
  group_by(Clade) %>% 
  summarise(n = n()) 
## # A tibble: 2 × 2
##   Clade     n
##   <chr> <int>
## 1 A       606
## 2 C        52
  • Clade A has 606 observations. 606/(606+52) = about 92 %

  • Clade C has 52 observations. 52/(606+52) = about 8 %

Notes: If we are trying to explain the variability of clades based on bacteria, logistic regression would be used to explain or predict either clade A or C with bacteria ASVs as columns; however, taking out other info (such as region) is not advisable as they are all possible predictors.

  • How correlated are the top 2 ASVs with each of the algae variables?

  • Logging the ASV counts presents better visualization:

corals %>% 
  ggplot(aes(x = log(ASV0001), y = log(ASV0002), color = Clade)) +
  geom_point(position = "jitter") + 
  facet_wrap(~species)

  • Clade A is far more prevalent as is also shown in the number of abservations per Clade based on these data.

  • Distributions are different per species. ASV1 and 2 are more clustered for species P and more spread out in terms of counts for species S.

  • From coseq() K-means clustering, this is perhaps why ASV1 and 2 were consistently clustered together in species P but not for species S. Perhaps ASVs 1 and 2 play a different role in species S.

Data Subsetting of first 1000 most abundant ASVs per Species

Species P Data Subsetting - Which ASVs are not found in at least 3 samples for Species P?

corals %>% 
  filter(species =="P") %>%  # 296:  cut off will be 293 inclusive
  select(-c(1:8)) -> corals_p # saving in order to pass to map() function

### These are the ASVs that are found in at least 3 samples or more for species P
map(corals_p, ~sum(.==0)) %>% 
  as_tibble() %>% 
  pivot_longer(cols = everything(), 
               names_to = "ASV_column", 
               values_to = "sum_col_eq_0") %>% 
  arrange(sum_col_eq_0) %>% 
  # cut off includes 293 or more because 296-293 = 3 
  filter(sum_col_eq_0 >=293) -> asvs_speciesP
  • Which ASVs are NOT found in Species P? ASV3
map(corals_p, ~sum(.==0)) %>% 
  as_tibble() %>% 
  pivot_longer(cols = everything(), 
               names_to = "ASV_column", 
               values_to = "sum_col_eq_0") %>% 
  # cut off includes 293 or more because 296-293 = 3 
  filter(sum_col_eq_0 == 0)
## # A tibble: 1 × 2
##   ASV_column sum_col_eq_0
##   <chr>             <int>
## 1 ASV0003               0
  • Take out ASV3 for species P data subset
# use original data frame to get other columns back into data subset
corals %>% 
  filter(species == "P") %>% 
  # take out asv 3 bc we know it's not found in this species
  # take out species column 
  select(-ASV0003, -species) %>% 
  group_by(Clade  ) %>% 
  summarise(n = n())
## # A tibble: 2 × 2
##   Clade     n
##   <chr> <int>
## 1 A       283
## 2 C        13
  • Note: 13/283 is about 5 %

Species S Data Subsetting - Which ASVs are not found in at least 3 samples for Species S?

corals %>% 
  filter(species =="S") %>%  # 362:  cut off will be 362-3 =  359 inclusive
  select(-c(1:8)) -> corals_s # saving here in order to pass to map() function

### These are the ASVs that are found in at least 3 samples or more for species S
map(corals_s, ~sum(.==0)) %>% 
  as_tibble() %>% 
  pivot_longer(cols = everything(), 
               names_to = "ASV_column", 
               values_to = "sum_col_eq_0") %>% 
  arrange(sum_col_eq_0) %>% 
  # cut off includes 293 or more because 296-293 = 3 
  filter(sum_col_eq_0 >=359) -> asvs_speciesS
  • Which ASVs are NOT found in Species S ? All 5,889 ASVs are found in Species S.
map(corals_s, ~sum(.==0)) %>% 
  as_tibble() %>% 
  pivot_longer(cols = everything(), 
               names_to = "ASV_column", 
               values_to = "sum_col_eq_0") %>% 
  # cut off includes 359 or more 
  filter(sum_col_eq_0 == 0)
## # A tibble: 0 × 2
## # ℹ 2 variables: ASV_column <chr>, sum_col_eq_0 <int>
dim(corals_s)
## [1]  362 5889
head(corals_s)
## # A tibble: 6 × 5,889
##   ASV0001 ASV0002 ASV0003 ASV0004 ASV0005 ASV0006 ASV0007 ASV0008 ASV0009
##     <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
## 1    2544       0       0       0       0    8018     920       0     847
## 2      38      39       0      55       0   23052       0       0       0
## 3     113       0       0       0       0     270       0       0       0
## 4   28881       0       0       0       0   21607    8200       0    6895
## 5       0       0      64       0       0   10263       0       0       0
## 6   10306       0       0       0       0     611    3533       0    2900
## # ℹ 5,880 more variables: ASV0010 <dbl>, ASV0011 <dbl>, ASV0012 <dbl>,
## #   ASV0013 <dbl>, ASV0014 <dbl>, ASV0015 <dbl>, ASV0016 <dbl>, ASV0017 <dbl>,
## #   ASV0018 <dbl>, ASV0019 <dbl>, ASV0020 <dbl>, ASV0022 <dbl>, ASV0023 <dbl>,
## #   ASV0024 <dbl>, ASV0025 <dbl>, ASV0027 <dbl>, ASV0030 <dbl>, ASV0031 <dbl>,
## #   ASV0032 <dbl>, ASV0033 <dbl>, ASV0034 <dbl>, ASV0035 <dbl>, ASV0036 <dbl>,
## #   ASV0037 <dbl>, ASV0039 <dbl>, ASV0040 <dbl>, ASV0041 <dbl>, ASV0042 <dbl>,
## #   ASV0043 <dbl>, ASV0044 <dbl>, ASV0046 <dbl>, ASV0047 <dbl>, …

Visualization Experiment with t-SNE Plot:

  • t-SNE is a way to visualize high-dimensional data and is recommended for hundreds of thousands of ASVs, recommended by Teaching Assistant for Data 793.
  • On a separate occassion previously, Client indicated that it is also possible to use first 1000 ASVs as these are already ranked in terms of the abundance they were found in the data.

Code adapted from here

library(Rtsne)
  • For t-SNE, select top 1000 ASVs no matter species to test out visualization
corals %>% 
  select(starts_with("ASV")[1:1000]) %>% 
  as.matrix() -> asv_matrix

Example t-SNE visualization

library(tsne)
library(plotly)
data("iris")

features <- subset(iris, select = -c(Species)) 

set.seed(0)
tsne <- tsne(features, initial_dims = 2)
## sigma summary: Min. : 0.376979658833158 |1st Qu. : 0.45299119244845 |Median : 0.509480199794486 |Mean : 0.520650714341092 |3rd Qu. : 0.579571467464058 |Max. : 0.758492715638686 |
## Epoch: Iteration #100 error is: 10.6877963361652
## Epoch: Iteration #200 error is: 0.0730593518613691
## Epoch: Iteration #300 error is: 0.0688210768935925
## Epoch: Iteration #400 error is: 0.0677845552661532
## Epoch: Iteration #500 error is: 0.0674575742993366
## Epoch: Iteration #600 error is: 0.0672632485595604
## Epoch: Iteration #700 error is: 0.0671387737166583
## Epoch: Iteration #800 error is: 0.0670575761754511
## Epoch: Iteration #900 error is: 0.0670020352301366
## Epoch: Iteration #1000 error is: 0.0669657774358675
tsne <- data.frame(tsne)
pdb <- cbind(tsne,iris$Species)
options(warn = -1)
fig <-  plot_ly(data = pdb ,x =  ~X1, y = ~X2, type = 'scatter', mode = 'markers', split = ~iris$Species)

fig <- fig %>%
  layout(
    plot_bgcolor = "#e5ecf6"
  )

fig
dim(asv_matrix)
## [1]  658 1000

t-SNE on coral samples and top 1000 ASVs - Both Species

library(tsne)
library(plotly)


features <- subset(asv_matrix) # 658 x 1000

set.seed(0)
tsne <- tsne(features, initial_dims = 2)
## sigma summary: Min. : 0.0717351762735381 |1st Qu. : 0.14417917387887 |Median : 0.25304682961265 |Mean : 0.259781738918223 |3rd Qu. : 0.331287316162124 |Max. : 0.881559216277751 |
## Epoch: Iteration #100 error is: 9.5805974100228
## Epoch: Iteration #200 error is: 0.215750400209934
## Epoch: Iteration #300 error is: 0.167603391129179
## Epoch: Iteration #400 error is: 0.152457463386035
## Epoch: Iteration #500 error is: 0.146676402504261
## Epoch: Iteration #600 error is: 0.143524659317524
## Epoch: Iteration #700 error is: 0.14148496677019
## Epoch: Iteration #800 error is: 0.140058690361463
## Epoch: Iteration #900 error is: 0.1390071568364
## Epoch: Iteration #1000 error is: 0.138210903902764
tsne <- data.frame(tsne)
pdb <- cbind(tsne,corals$species)
options(warn = -1)
fig <-  plot_ly(data = pdb ,x =  ~X1, y = ~X2, type = 'scatter', mode = 'markers', split = ~corals$species)

fig <- fig %>%
  layout(
    plot_bgcolor = "#e5ecf6"
  )

fig

t-SNE for Clade - Both Species - Top 1000 ASVs

library(tsne)
library(plotly)

features <- subset(asv_matrix) 

set.seed(0)
tsne <- tsne(features, initial_dims = 2)
## sigma summary: Min. : 0.0717351762735381 |1st Qu. : 0.14417917387887 |Median : 0.25304682961265 |Mean : 0.259781738918223 |3rd Qu. : 0.331287316162124 |Max. : 0.881559216277751 |
## Epoch: Iteration #100 error is: 9.5805974100228
## Epoch: Iteration #200 error is: 0.215750400209934
## Epoch: Iteration #300 error is: 0.167603391129179
## Epoch: Iteration #400 error is: 0.152457463386035
## Epoch: Iteration #500 error is: 0.146676402504261
## Epoch: Iteration #600 error is: 0.143524659317524
## Epoch: Iteration #700 error is: 0.14148496677019
## Epoch: Iteration #800 error is: 0.140058690361463
## Epoch: Iteration #900 error is: 0.1390071568364
## Epoch: Iteration #1000 error is: 0.138210903902764
tsne <- data.frame(tsne)
pdb <- cbind(tsne,corals$Clade)
options(warn = -1)
fig2 <-  plot_ly(data = pdb ,x =  ~X1, y = ~X2, type = 'scatter', mode = 'markers', split = ~corals$Clade)

fig2 <- fig2 %>%
  layout(
    plot_bgcolor = "#e5ecf6"
  )

fig2

t-SNE for Species P

# library(tsne)
# library(plotly)
# 
# 
# 
# features <- as.matrix(corals_p, select..........)
# 
# set.seed(0)
# tsne <- tsne(features, initial_dims = 2)
# tsne <- data.frame(tsne)
# pdb <- cbind(tsne,corals$Clade)
# options(warn = -1)
# fig <-  plot_ly(data = pdb ,x =  ~X1, y = ~X2, type = 'scatter', mode = 'markers', split = ~corals$Clade)
# 
# fig <- fig %>%
#   layout(
#     plot_bgcolor = "#e5ecf6"
#   )
# 
# fig

GGpairs on top 10 ASVs and Clade for Species P

corals[, -c(1,3,4,5,7,8) ] %>% 
  filter(species=="P") %>% 
  select(1, 3:12) %>% 
  GGally::ggpairs(aes(color = Clade)) -> ggpairs_species_P
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
ggpairs_species_P
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggsave( "../plots/ggpairs_species_P.png", ggpairs_species_P, width = 14)
## Saving 14 x 5 in image
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

GGpairs on top 10 ASVs and Clade for Species P

corals[, -c(1,3,4,5,7,8) ] %>% 
  filter(species=="S") %>% 
  select(1, 3:12) %>% 
  GGally::ggpairs(aes(color = Clade)) -> ggpairs_species_S

ggpairs_species_S
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggsave( "../plots/ggpairs_species_S.png", ggpairs_species_S, width = 14)
## Saving 14 x 5 in image
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.